# install.packages("fpp3")
library(fpp3)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggtime)
library(patchwork)
library(DT)Project 2 - Data Cubbing and Binning
For the Project 2 we further processed our dataset by leveraging binning and a data cube structure built on hierarchies .
1. Introduction
Analyzing the State_time_series.csv dataset, which contains granular real estate metrics over time, benefits significantly from data transformation and aggregation. Binning is essential for this process, as it converts continuous variables, such as MedianListingPrice_AllHomes, into discrete, manageable categories (e.g., ‘$150k-$200k’, ‘$200k-$250k’). This discretization simplifies complex data, making it easier to summarize, visualize, and identify trends.
Furthermore, leveraging a data cube structure built on hierarchies allows for powerful, multidimensional analysis. For instance, the Date field isn’t just a single point in time; it’s part of a hierarchy that can be “rolled up” from a specific day to a Month, Quarter, or Year. Similarly, the RegionName (State) could be aggregated into broader geographical regions (e.g., “Northeast”, “West Coast”). By combining these binned and hierarchical dimensions, we can quickly “slice and dice” the data to answer complex questions, such as “How many homes in the $200k-$250k price bin were available in the Northeast region during Q3 2018?” This turns a massive, raw dataset into a flexible tool for gaining actionable insights.
The Zillow Home Value Index or ZHVI is a smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. It reflects the typical value for homes in the 35th to 65th percentile range.
The Dataset Zillow Economics Data[ZHVI?], can be downloaded:
```{bash}
#!/bin/bash
curl -L -o ~/Downloads/zecon.zip\
https://www.kaggle.com/api/v1/datasets/download/zillow/zecon
```1.1 Load packages
Possible errors makings notes:
```{bash}
Attaching package: ‘ggtime’
The following objects are masked from ‘package:feasts’:
gg_arma, gg_irf, gg_lag, gg_season, gg_subseries, gg_tsdisplay,
gg_tsresiduals
```2. Load the Dataset
The following code will locate the folder datasets and then it will give back the variable datasets_path which you can use to build the path to the desired data to be loaded.
Code
find_git_root <- function(start = getwd()) {
path <- normalizePath(start, winslash = "/", mustWork = TRUE)
while (path != dirname(path)) {
if (dir.exists(file.path(path, ".git"))) return(path)
path <- dirname(path)
}
stop("No .git directory found — are you inside a Git repository?")
}
repo_root <- find_git_root()
datasets_path <- file.path(repo_root, "datasets")
zillow_economics_data_path <- file.path(datasets_path, "zillow-economics-data-01")
state_time_series <- file.path(zillow_economics_data_path, "State_time_series.csv")
all_states_data <- read.csv(state_time_series)3. Data Exploration and Processing
The ‘Date’ column is a character and should be converted to a Date object.
all_states_data$Date <- as.Date(all_states_data$Date)
# str(all_states_data)Exploring the N/As and the structure of the dataset:
# Get the total number of rows for calculating percentages
total_rows <- nrow(all_states_data)
# Create a pretty summary table
na_summary <- all_states_data %>%
# 1. Count NAs for every column
summarise(across(everything(), ~sum(is.na(.)))) %>%
# 2. Pivot the data from wide to long
pivot_longer(everything(),
names_to = "Column",
values_to = "NA_Count") %>%
# 3. (Optional) Filter to only show columns that HAVE NAs
filter(NA_Count > 0) %>%
# 4. (Optional) Add a percentage column
mutate(NA_Percentage = (NA_Count / total_rows) * 100) %>%
# 5. Sort by the highest NA count
arrange(desc(NA_Count))
# Print the clean table
print(na_summary)# A tibble: 80 × 3
Column NA_Count NA_Percentage
<chr> <int> <dbl>
1 PctOfHomesSellingForGain_AllHomes 12609 95.4
2 PctOfHomesSellingForLoss_AllHomes 12609 95.4
3 MedianRentalPrice_5BedroomOrMore 11994 90.8
4 MedianRentalPricePerSqft_5BedroomOrMore 11752 88.9
5 MedianRentalPricePerSqft_Studio 10875 82.3
6 MedianRentalPrice_CondoCoop 10437 79.0
7 MedianRentalPricePerSqft_DuplexTriplex 10293 77.9
8 MedianRentalPrice_Studio 10211 77.3
9 MedianListingPrice_1Bedroom 10205 77.2
10 MedianRentalPrice_DuplexTriplex 10068 76.2
# ℹ 70 more rows
Sample of the Dataset in the form of a data table to make visually clear how the dataset looks like and what features are available.
print("all_states_data (tsibble):")[1] "all_states_data (tsibble):"
DT::datatable(all_states_data, options = list(pageLength = 5))Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
library(skimr)
skim(all_states_data)| Name | all_states_data |
| Number of rows | 13212 |
| Number of columns | 82 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 1 |
| numeric | 80 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| RegionName | 0 | 1 | 4 | 18 | 0 | 52 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Date | 0 | 1 | 1996-04-30 | 2017-12-31 | 2007-05-31 | 261 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| DaysOnZillow_AllHomes | 8367 | 0.37 | 110.12 | 27.47 | 49.25 | 90.25 | 108.50 | 126.75 | 251.62 | ▃▇▃▁▁ |
| InventorySeasonallyAdjusted_AllHomes | 8316 | 0.37 | 33292.66 | 34926.70 | 972.00 | 9828.50 | 21712.50 | 47453.00 | 260687.00 | ▇▂▁▁▁ |
| InventoryRaw_AllHomes | 8316 | 0.37 | 33299.01 | 35014.16 | 911.00 | 9756.25 | 21289.00 | 46891.00 | 268055.00 | ▇▂▁▁▁ |
| MedianListingPricePerSqft_1Bedroom | 9626 | 0.27 | 182.47 | 99.81 | 57.14 | 125.69 | 162.75 | 202.63 | 627.55 | ▇▅▁▁▁ |
| MedianListingPricePerSqft_2Bedroom | 8678 | 0.34 | 135.49 | 74.25 | 60.00 | 92.16 | 121.30 | 152.24 | 550.64 | ▇▂▁▁▁ |
| MedianListingPricePerSqft_3Bedroom | 8605 | 0.35 | 129.45 | 58.84 | 56.48 | 93.28 | 116.63 | 143.19 | 460.47 | ▇▃▁▁▁ |
| MedianListingPricePerSqft_4Bedroom | 8535 | 0.35 | 133.46 | 58.34 | 61.80 | 99.60 | 119.82 | 144.75 | 480.77 | ▇▂▁▁▁ |
| MedianListingPricePerSqft_5BedroomOrMore | 8643 | 0.35 | 135.67 | 64.92 | 63.78 | 99.08 | 119.75 | 149.35 | 617.96 | ▇▁▁▁▁ |
| MedianListingPricePerSqft_AllHomes | 8538 | 0.35 | 136.66 | 68.76 | 62.14 | 96.01 | 120.58 | 153.44 | 520.72 | ▇▂▁▁▁ |
| MedianListingPricePerSqft_CondoCoop | 9063 | 0.31 | 163.55 | 99.42 | 61.92 | 113.08 | 141.00 | 177.07 | 1000.00 | ▇▁▁▁▁ |
| MedianListingPricePerSqft_DuplexTriplex | 9248 | 0.30 | 97.09 | 57.02 | 32.14 | 60.89 | 81.21 | 113.52 | 446.43 | ▇▂▁▁▁ |
| MedianListingPricePerSqft_SingleFamilyResidence | 8573 | 0.35 | 133.37 | 60.79 | 63.27 | 95.55 | 120.09 | 149.84 | 475.36 | ▇▂▁▁▁ |
| MedianListingPrice_1Bedroom | 10205 | 0.23 | 147083.00 | 67877.11 | 49900.00 | 99000.00 | 130000.00 | 169900.00 | 399000.00 | ▇▇▂▁▁ |
| MedianListingPrice_2Bedroom | 8839 | 0.33 | 158872.73 | 80629.40 | 57000.00 | 109500.00 | 147000.00 | 179900.00 | 599000.00 | ▇▃▁▁▁ |
| MedianListingPrice_3Bedroom | 8842 | 0.33 | 209225.76 | 89233.54 | 109900.00 | 149000.00 | 189900.00 | 240000.00 | 687000.00 | ▇▃▁▁▁ |
| MedianListingPrice_4Bedroom | 8876 | 0.33 | 310849.76 | 115574.82 | 169000.00 | 238744.50 | 283020.00 | 339900.00 | 950000.00 | ▇▃▁▁▁ |
| MedianListingPrice_5BedroomOrMore | 8989 | 0.32 | 416375.41 | 191106.68 | 159900.00 | 310000.00 | 369700.00 | 452172.50 | 1847500.00 | ▇▂▁▁▁ |
| MedianListingPrice_AllHomes | 8966 | 0.32 | 223378.49 | 84461.60 | 112944.00 | 159900.00 | 209000.00 | 259900.00 | 610000.00 | ▇▆▁▁▁ |
| MedianListingPrice_CondoCoop | 9402 | 0.29 | 202333.42 | 79067.15 | 82500.00 | 152362.50 | 184900.00 | 228000.00 | 754500.00 | ▇▃▁▁▁ |
| MedianListingPrice_DuplexTriplex | 9323 | 0.29 | 207474.74 | 119710.56 | 64900.00 | 129900.00 | 178900.00 | 245000.00 | 939000.00 | ▇▂▁▁▁ |
| MedianListingPrice_SingleFamilyResidence | 9082 | 0.31 | 228169.61 | 96836.13 | 112900.00 | 159900.00 | 209900.00 | 265000.00 | 725000.00 | ▇▅▁▁▁ |
| MedianPctOfPriceReduction_AllHomes | 8724 | 0.34 | 3.85 | 0.82 | 1.74 | 3.24 | 3.72 | 4.35 | 8.34 | ▂▇▃▁▁ |
| MedianPctOfPriceReduction_CondoCoop | 9340 | 0.29 | 4.02 | 1.06 | 1.68 | 3.26 | 3.81 | 4.56 | 10.00 | ▅▇▂▁▁ |
| MedianPctOfPriceReduction_SingleFamilyResidence | 8724 | 0.34 | 3.85 | 0.82 | 1.72 | 3.23 | 3.74 | 4.35 | 8.35 | ▂▇▃▁▁ |
| MedianPriceCutDollar_AllHomes | 8724 | 0.34 | 8034.11 | 3071.98 | 5000.00 | 5100.00 | 7500.00 | 10000.00 | 24000.00 | ▇▆▁▁▁ |
| MedianPriceCutDollar_CondoCoop | 9340 | 0.29 | 7453.13 | 2890.12 | 2050.00 | 5000.00 | 6800.00 | 10000.00 | 27753.50 | ▇▆▁▁▁ |
| MedianPriceCutDollar_SingleFamilyResidence | 8724 | 0.34 | 8244.97 | 3547.92 | 5000.00 | 5300.00 | 7900.00 | 10000.00 | 26000.00 | ▇▅▁▁▁ |
| MedianRentalPricePerSqft_1Bedroom | 9588 | 0.27 | 1.40 | 0.57 | 0.72 | 0.99 | 1.21 | 1.61 | 3.37 | ▇▅▂▁▁ |
| MedianRentalPricePerSqft_2Bedroom | 9065 | 0.31 | 1.11 | 0.45 | 0.52 | 0.81 | 0.95 | 1.26 | 3.11 | ▇▅▁▁▁ |
| MedianRentalPricePerSqft_3Bedroom | 8985 | 0.32 | 0.97 | 0.32 | 0.53 | 0.75 | 0.86 | 1.07 | 2.45 | ▇▃▁▁▁ |
| MedianRentalPricePerSqft_4Bedroom | 9808 | 0.26 | 0.85 | 0.25 | 0.49 | 0.68 | 0.78 | 0.94 | 2.13 | ▇▅▁▁▁ |
| MedianRentalPricePerSqft_5BedroomOrMore | 11752 | 0.11 | 0.76 | 0.16 | 0.36 | 0.63 | 0.74 | 0.89 | 1.27 | ▁▇▆▃▁ |
| MedianRentalPricePerSqft_AllHomes | 8864 | 0.33 | 1.01 | 0.42 | 0.58 | 0.76 | 0.88 | 1.12 | 3.15 | ▇▂▁▁▁ |
| MedianRentalPricePerSqft_CondoCoop | 10004 | 0.24 | 1.30 | 0.68 | 0.64 | 0.90 | 1.07 | 1.41 | 4.82 | ▇▁▁▁▁ |
| MedianRentalPricePerSqft_DuplexTriplex | 10293 | 0.22 | 1.07 | 0.56 | 0.50 | 0.74 | 0.87 | 1.12 | 3.86 | ▇▁▁▁▁ |
| MedianRentalPricePerSqft_MultiFamilyResidence5PlusUnits | 9189 | 0.30 | 1.17 | 0.50 | 0.59 | 0.84 | 1.00 | 1.30 | 3.38 | ▇▃▁▁▁ |
| MedianRentalPricePerSqft_SingleFamilyResidence | 8923 | 0.32 | 0.94 | 0.28 | 0.58 | 0.75 | 0.86 | 1.03 | 2.44 | ▇▃▁▁▁ |
| MedianRentalPricePerSqft_Studio | 10875 | 0.18 | 1.18 | 0.74 | 0.57 | 0.74 | 0.85 | 1.26 | 3.98 | ▇▂▁▁▁ |
| MedianRentalPrice_1Bedroom | 9686 | 0.27 | 978.56 | 431.03 | 495.00 | 650.00 | 860.00 | 1195.00 | 2690.00 | ▇▃▁▁▁ |
| MedianRentalPrice_2Bedroom | 9168 | 0.31 | 1097.62 | 473.15 | 575.00 | 775.00 | 925.00 | 1279.00 | 3215.00 | ▇▃▁▁▁ |
| MedianRentalPrice_3Bedroom | 9075 | 0.31 | 1356.52 | 465.58 | 750.00 | 1050.00 | 1200.00 | 1515.00 | 3550.00 | ▇▃▁▁▁ |
| MedianRentalPrice_4Bedroom | 9856 | 0.25 | 1722.48 | 486.16 | 950.00 | 1395.00 | 1595.00 | 1950.00 | 3850.00 | ▇▆▂▁▁ |
| MedianRentalPrice_5BedroomOrMore | 11994 | 0.09 | 2139.04 | 698.33 | 795.00 | 1750.00 | 1995.00 | 2495.00 | 4500.00 | ▂▇▃▁▁ |
| MedianRentalPrice_AllHomes | 9060 | 0.31 | 1362.29 | 455.57 | 750.00 | 1050.00 | 1200.00 | 1590.00 | 3600.00 | ▇▃▁▁▁ |
| MedianRentalPrice_CondoCoop | 10437 | 0.21 | 1410.25 | 480.29 | 697.50 | 1050.00 | 1295.00 | 1595.00 | 3200.00 | ▇▇▂▂▁ |
| MedianRentalPrice_DuplexTriplex | 10068 | 0.24 | 963.56 | 434.44 | 500.00 | 685.00 | 800.00 | 1100.00 | 2895.00 | ▇▂▁▁▁ |
| MedianRentalPrice_MultiFamilyResidence5PlusUnits | 9029 | 0.32 | 1091.13 | 449.92 | 550.00 | 750.00 | 950.00 | 1296.00 | 2895.00 | ▇▃▁▁▁ |
| MedianRentalPrice_SingleFamilyResidence | 9120 | 0.31 | 1362.84 | 442.51 | 750.00 | 1050.00 | 1205.00 | 1582.25 | 3400.00 | ▇▅▁▁▁ |
| MedianRentalPrice_Studio | 10211 | 0.23 | 1220.91 | 353.50 | 490.00 | 975.00 | 1150.00 | 1400.00 | 2500.00 | ▂▇▅▁▁ |
| ZHVIPerSqft_AllHomes | 620 | 0.95 | 116.42 | 62.91 | 35.00 | 77.00 | 98.00 | 141.00 | 499.00 | ▇▃▁▁▁ |
| PctOfHomesDecreasingInValues_AllHomes | 4292 | 0.68 | 33.29 | 22.47 | 0.18 | 16.98 | 27.45 | 45.08 | 99.38 | ▇▇▃▂▁ |
| PctOfHomesIncreasingInValues_AllHomes | 4292 | 0.68 | 59.03 | 23.77 | 0.47 | 44.22 | 63.54 | 76.30 | 99.76 | ▂▃▅▇▅ |
| PctOfHomesSellingForGain_AllHomes | 12609 | 0.05 | 85.88 | 11.99 | 50.99 | 79.09 | 89.97 | 95.63 | 100.00 | ▁▁▂▃▇ |
| PctOfHomesSellingForLoss_AllHomes | 12609 | 0.05 | 14.12 | 11.99 | 0.00 | 4.36 | 10.03 | 20.91 | 49.01 | ▇▃▂▁▁ |
| PctOfListingsWithPriceReductionsSeasAdj_AllHomes | 8724 | 0.34 | 12.38 | 2.22 | 4.51 | 10.91 | 12.31 | 13.92 | 20.64 | ▁▃▇▃▁ |
| PctOfListingsWithPriceReductionsSeasAdj_CondoCoop | 9164 | 0.31 | 10.39 | 2.60 | 2.23 | 8.53 | 10.38 | 12.19 | 19.61 | ▁▅▇▃▁ |
| PctOfListingsWithPriceReductionsSeasAdj_SingleFamilyResidence | 8724 | 0.34 | 12.60 | 2.27 | 4.44 | 11.09 | 12.54 | 14.11 | 20.76 | ▁▃▇▃▁ |
| PctOfListingsWithPriceReductions_AllHomes | 8724 | 0.34 | 12.36 | 3.00 | 3.56 | 10.23 | 12.30 | 14.37 | 20.92 | ▁▅▇▅▁ |
| PctOfListingsWithPriceReductions_CondoCoop | 9164 | 0.31 | 10.38 | 3.00 | 1.79 | 8.25 | 10.25 | 12.38 | 22.90 | ▁▇▇▂▁ |
| PctOfListingsWithPriceReductions_SingleFamilyResidence | 8724 | 0.34 | 12.58 | 3.07 | 3.74 | 10.42 | 12.50 | 14.62 | 21.69 | ▁▅▇▅▁ |
| PriceToRentRatio_AllHomes | 8912 | 0.33 | 11.44 | 2.22 | 7.05 | 9.73 | 11.14 | 12.74 | 21.55 | ▆▇▃▁▁ |
| Sale_Counts | 7837 | 0.41 | 7065.63 | 7978.75 | 130.00 | 1672.00 | 4546.00 | 9247.00 | 50275.00 | ▇▂▁▁▁ |
| Sale_Counts_Seas_Adj | 7837 | 0.41 | 7049.04 | 7808.91 | 242.00 | 1713.00 | 4764.00 | 9393.00 | 41779.00 | ▇▂▁▁▁ |
| Sale_Prices | 9218 | 0.30 | 194551.78 | 75755.80 | 83800.00 | 137000.00 | 180900.00 | 235775.00 | 543100.00 | ▇▆▂▁▁ |
| ZHVI_1bedroom | 2607 | 0.80 | 117060.42 | 61056.04 | 30900.00 | 74600.00 | 100400.00 | 142300.00 | 390200.00 | ▇▆▁▁▁ |
| ZHVI_2bedroom | 1467 | 0.89 | 135168.80 | 72267.65 | 32800.00 | 86700.00 | 115400.00 | 166800.00 | 542400.00 | ▇▅▁▁▁ |
| ZHVI_3bedroom | 425 | 0.97 | 167062.80 | 80212.51 | 49600.00 | 116400.00 | 141200.00 | 204400.00 | 639700.00 | ▇▃▁▁▁ |
| ZHVI_4bedroom | 853 | 0.94 | 243829.67 | 106216.76 | 64700.00 | 174900.00 | 218000.00 | 281000.00 | 850400.00 | ▇▆▁▁▁ |
| ZHVI_5BedroomOrMore | 1398 | 0.89 | 323733.49 | 170291.68 | 68600.00 | 217900.00 | 288000.00 | 365300.00 | 1497000.00 | ▇▂▁▁▁ |
| ZHVI_AllHomes | 774 | 0.94 | 169753.41 | 83882.41 | 38200.00 | 114500.00 | 144750.00 | 207600.00 | 620400.00 | ▇▅▁▁▁ |
| ZHVI_BottomTier | 896 | 0.93 | 102669.97 | 49705.35 | 32600.00 | 66600.00 | 87400.00 | 128200.00 | 335600.00 | ▇▅▂▁▁ |
| ZHVI_CondoCoop | 1530 | 0.88 | 156769.89 | 79695.00 | 42200.00 | 111300.00 | 134700.00 | 175800.00 | 782900.00 | ▇▂▁▁▁ |
| ZHVI_MiddleTier | 774 | 0.94 | 169753.41 | 83882.41 | 38200.00 | 114500.00 | 144750.00 | 207600.00 | 620400.00 | ▇▅▁▁▁ |
| ZHVI_SingleFamilyResidence | 774 | 0.94 | 174154.38 | 92243.37 | 37900.00 | 115000.00 | 147300.00 | 211775.00 | 737500.00 | ▇▃▁▁▁ |
| ZHVI_TopTier | 688 | 0.95 | 293973.81 | 147829.06 | 70900.00 | 194700.00 | 251100.00 | 349400.00 | 988100.00 | ▇▆▂▁▁ |
| ZRI_AllHomes | 8958 | 0.32 | 1321.33 | 371.01 | 799.00 | 1047.00 | 1210.00 | 1474.00 | 2690.00 | ▇▇▂▁▁ |
| ZRI_AllHomesPlusMultifamily | 8876 | 0.33 | 1318.06 | 368.78 | 799.00 | 1036.00 | 1210.00 | 1477.00 | 2653.00 | ▇▇▂▁▁ |
| ZriPerSqft_AllHomes | 8876 | 0.33 | 0.93 | 0.28 | 0.56 | 0.73 | 0.86 | 1.07 | 2.29 | ▇▅▁▁▁ |
| Zri_MultiFamilyResidenceRental | 8876 | 0.33 | 1233.05 | 369.43 | 713.00 | 959.75 | 1126.00 | 1399.50 | 2606.00 | ▇▅▃▁▁ |
| Zri_SingleFamilyResidenceRental | 8958 | 0.32 | 1327.52 | 383.31 | 799.00 | 1039.25 | 1220.00 | 1467.75 | 2754.00 | ▇▆▂▁▁ |
# --- 1. Create Date Hierarchy ---
# We parse the Date and extract hierarchy levels
processed_data <- all_states_data %>%
mutate(
Year = year(Date),
Quarter = quarter(Date, with_year = TRUE),
Month = month(Date, label = TRUE)
)# --- 2. Create Bins for a Measure ---
# We bin 'MedianListingPrice_AllHomes' into categories
price_breaks <- c(0, 150000, 200000, 250000, 300000, 500000, Inf)
price_labels <- c(
"Under $150k", "$150k-$200k", "$200k-$250k",
"$250k-$300k", "$300k-$500k", "Over $500k"
)processed_data <- processed_data %>%
mutate(
Price_Bin = cut(MedianListingPrice_AllHomes,
breaks = price_breaks,
labels = price_labels,
right = FALSE) # Bins are [min, max)
)# Show the new hierarchical and binned columns
print("Data with new hierarchy and bins:")[1] "Data with new hierarchy and bins:"
glimpse(processed_data %>%
select(RegionName, Date, Year, Quarter, Month, MedianListingPrice_AllHomes, Price_Bin))Rows: 13,212
Columns: 7
$ RegionName <chr> "Alabama", "Arizona", "Arkansas", "Califor…
$ Date <date> 1996-04-30, 1996-04-30, 1996-04-30, 1996-…
$ Year <dbl> 1996, 1996, 1996, 1996, 1996, 1996, 1996, …
$ Quarter <dbl> 1996.2, 1996.2, 1996.2, 1996.2, 1996.2, 19…
$ Month <ord> Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Ap…
$ MedianListingPrice_AllHomes <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Price_Bin <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
4. Explore ZHVI with Date Hierarchy
For sake of simplicity we are going to focus for now on a single State (California) and the Zillow Home Value Index (ZHVI_AllHomes) over time.
# Let's analyze the Zillow Home Value Index (ZHVI_AllHomes) for "California".
# all_states_data
cali_zhvi <- processed_data %>%
filter(RegionName == "California") %>%
select(Date, ZHVI_AllHomes) %>%
# Remove any missing values for this metric
na.omit()
cali_ts <- cali_zhvi %>%
as_tsibble(index = Date)# We set 'Date' as the 'index' (the time component).
cali_ts_fill_gaps <- cali_zhvi %>%
as_tsibble(index = Date) %>%
fill_gaps()
# If you were analyzing multiple states, you would add a 'key'.
# Example for multiple states (not run here):
# multi_state_ts <- all_states_data %>%
# select(Date, RegionName, ZHVI_AllHomes) %>%
# as_tsibble(index = Date, key = RegionName)Lets see how many N/As after filling the gaps:
# Count total NAs per column
colSums(is.na(cali_ts))
colSums(is.na(cali_ts_fill_gaps))# Inspect the new, focused data frame
head(cali_zhvi) Date ZHVI_AllHomes
1 1996-04-30 157900
2 1996-05-31 157800
3 1996-06-30 157500
4 1996-07-31 157300
5 1996-08-31 157000
6 1996-09-30 156800
head(cali_ts)# A tsibble: 6 x 2 [1D]
Date ZHVI_AllHomes
<date> <int>
1 1996-04-30 157900
2 1996-05-31 157800
3 1996-06-30 157500
4 1996-07-31 157300
5 1996-08-31 157000
6 1996-09-30 156800
head(cali_ts_fill_gaps)# A tsibble: 6 x 2 [1D]
Date ZHVI_AllHomes
<date> <int>
1 1996-04-30 157900
2 1996-05-01 NA
3 1996-05-02 NA
4 1996-05-03 NA
5 1996-05-04 NA
6 1996-05-05 NA
cali_ts %>%
autoplot(ZHVI_AllHomes) +
labs(title = "Zillow Home Value Index (ZHVI) for California",
y = "Home Value Index",
x = "Year") +
theme_minimal()
# This will scan cali_ts and report any gaps
gap_summary <- count_gaps(cali_ts)
# Print the summary
print(gap_summary)# A tibble: 260 × 3
.from .to .n
<date> <date> <int>
1 1996-05-01 1996-05-30 30
2 1996-06-01 1996-06-29 29
3 1996-07-01 1996-07-30 30
4 1996-08-01 1996-08-30 30
5 1996-09-01 1996-09-29 29
6 1996-10-01 1996-10-30 30
7 1996-11-01 1996-11-29 29
8 1996-12-01 1996-12-30 30
9 1997-01-01 1997-01-30 30
10 1997-02-01 1997-02-27 27
# ℹ 250 more rows
cali_ts_fill_gaps %>%
autoplot(ZHVI_AllHomes) +
labs(title = "Zillow Home Value Index (ZHVI) for California",
y = "Home Value Index",
x = "Year") +
theme_minimal()
# Inspect the new, focused data frame
# print("cali_zhvi (data frame):")
# DT::datatable(cali_zhvi, options = list(pageLength = 5))
print("cali_ts (tsibble):")[1] "cali_ts (tsibble):"
DT::datatable(cali_ts, options = list(pageLength = 5))# print("cali_ts_fill_gaps (tsibble with filled gaps):")
# DT::datatable(cali_ts_fill_gaps, options = list(pageLength = 5))- Create a new ‘Month’ column using the yearmonth() function
- Group by this new explicit month
- Summarise the data (using mean() is safe, but since you have one observation per month, last() or sum() would also work)
- Convert to a tsibble, now indexed by the new ‘Month’ object
cali_ts_monthly <- cali_zhvi %>%
mutate(Month = yearmonth(Date)) %>%
group_by(Month) %>%
summarise(ZHVI_AllHomes = mean(ZHVI_AllHomes, na.rm = TRUE)) %>%
as_tsibble(index = Month)
# Check the new object
print("New Monthly Tsibble:")[1] "New Monthly Tsibble:"
head(cali_ts_monthly)# A tsibble: 6 x 2 [1M]
Month ZHVI_AllHomes
<mth> <dbl>
1 1996 Apr 157900
2 1996 May 157800
3 1996 Jun 157500
4 1996 Jul 157300
5 1996 Aug 157000
6 1996 Sep 156800
# Note the <mth> tag in the output, indicating it's a monthly tsibble
print("Class of the new index:")[1] "Class of the new index:"
class(cali_ts_monthly$Month)[1] "yearmonth" "vctrs_vctr"
cali_ts_monthly %>%
autoplot(ZHVI_AllHomes) +
labs(title = "Zillow Home Value Index (ZHVI) for California",
y = "Home Value Index",
x = "Year") +
theme_minimal()
In R, Date objects are stored internally as the number of days that have passed since an “origin” date, which by default is January 1, 1970.
The number 9587 is the number of days since 1970-01-01.
This corresponds to the date April 30, 1996, which is the starting point of your Zillow dataset.
This happens if you (or R) accidentally convert a Date object to a plain number.
The yearmonth object we created in the last step is different. It prints as “1996 Apr” and internally stores the number of months since the 1970 epoch (which would be a much smaller number, like 316).
print("cali_ts_monthly (tsibble with only monthly):")[1] "cali_ts_monthly (tsibble with only monthly):"
DT::datatable(cali_ts_monthly, options = list(pageLength = 5))Reflecting on how the ZHVI is calculated and how it compares to other ways to calculate Home value index like: Case-Shiller
The Standard & Poor’s CoreLogic Case–Shiller Home Price Indices are repeat-sales house price indices for the United States.
Conceptual Accuracy (What it gets right)
The R code correctly captures the concept of measuring the middle-third of the market:
Focus on the Middle Tier: The core idea of calculating the ZHVI is to use an aggressively trimmed mean—specifically the mean of values between the 35th and 65th percentile. The code successfully identifies these cut-off points (\(\mathbf{q35}\) and \(\mathbf{q65}\)) and calculates the mean of the values within that range. This properly simulates the “robust” nature of the ZHVI, which aims to exclude extreme high-end and low-end sales that might skew a simple median or average.
Major Flaws (What it gets wrong)
- The Data Source is Incorrect (Sales vs. All Homes)
Your Code’s Flaw: The input data, hypothetical_sales, is a list of sales transactions.
Actual ZHVI Method: The ZHVI is calculated using the Zestimate for every single home (over 100 million properties), not just the few that sold in a given month.
- Impact: Indices based on sales (like Case-Shiller) are susceptible to the compositional bias of which homes happen to sell that month. The ZHVI avoids this by including the estimated value (Zestimate) for all homes, providing a much more comprehensive view of the entire market’s value, whether a house sold or not.
- It Calculates a Static Value, Not Appreciation
Your Code’s Flaw: The output, simulated_zhvi, is a single-month dollar value.
Actual ZHVI Method: The ZHVI is an index that tracks the change in home values over time (appreciation). It is a chained index where the monthly value is determined by:
Calculating the weighted mean appreciation of individual Zestimates from the prior month to the current month.
Applying that appreciation rate to the ZHVI level from the prior month.
- Impact: To calculate a correct ZHVI, you would need two separate sets of Zestimates (Time \(t-1\) and Time \(t\)), calculate the percentage change for each home, and then average those changes to determine the appreciation factor for the overall index. The code calculates a robust mean, but it doesn’t show how that mean changes month-over-month to create the index.
Hypothetical Fix for Flaw 2 (Conceptual only)
A conceptually more accurate approach would require comparing two months: 1. Month 1 Zestimates (t-1): Calculate the \(\text{Mean}_{35-65}\) of Zestimates for Month 1.
Month 2 Zestimates (t): Calculate the \(\text{Mean}_{35-65}\) of Zestimates for Month 2.
Appreciation:
\[\text{Monthly Appreciation} = \frac{\text{Mean}_{35-65}(t)}{\text{Mean}_{35-65}(t-1)} - 1\]
- ZHVI Calculation: \(\text{ZHVI}(t) = \text{ZHVI}(t-1) \times (1 + \text{Monthly Appreciation})\)
# this seems to be incorrect
# 1. Create a hypothetical set of 100 home sales.
# We'll use random numbers for this example.
set.seed(42) # Makes our "random" numbers reproducible
hypothetical_sales <- round(rnorm(100, mean = 350000, sd = 75000))
# 2. Find the 35th and 65th percentile values
# These are the "cut-off" points.
q35 <- quantile(hypothetical_sales, 0.35)
q65 <- quantile(hypothetical_sales, 0.65)
# 3. Filter to get only the "middle-tier" homes
# (i.e., homes with a value between the 35th and 65th percentile)
middle_tier_homes <- hypothetical_sales[
hypothetical_sales >= q35 & hypothetical_sales <= q65
]
# 4. Calculate the "Simulated ZHVI"
# This is the mean of only those middle-tier homes.
simulated_zhvi <- mean(middle_tier_homes)
# --- Print the results ---
print(paste("Total number of hypothetical sales:", length(hypothetical_sales)))[1] "Total number of hypothetical sales: 100"
print(paste("35th Percentile Value:", q35))[1] "35th Percentile Value: 328947.5"
print(paste("65th Percentile Value:", q65))[1] "65th Percentile Value: 389868.05"
print(paste("Number of homes in middle-tier (35th-65th percentile):", length(middle_tier_homes)))[1] "Number of homes in middle-tier (35th-65th percentile): 30"
print(paste("Simulated ZHVI (Mean of middle-tier):", round(simulated_zhvi, 2)))[1] "Simulated ZHVI (Mean of middle-tier): 359053.03"
ZHVI is probably being calculated from most likely the sales data that generated the MedianListingPrice_AllHomes.
MedianListingPrice_1Bedroom
MedianListingPrice_2Bedroom
MedianListingPrice_3Bedroom
MedianListingPrice_4Bedroom
MedianListingPrice_5BedroomOrMore
MedianListingPrice_AllHomes
MedianListingPrice_CondoCoop
MedianListingPrice_DuplexTriplex
MedianListingPrice_SingleFamilyResidence